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Abstract 



We describe a new method for visualizing topics, the distributions over terms that are automatically 
extracted from large text corpora using latent variable models. Our method finds significant n-grams related 
to a topic, which are then used to help understand and interpret the underlying distribution. Compared with 
the usual visualization, which simply lists the most probable topical terms, the multi-word expressions 
provide a better intuitive impression for what a topic is "about." Our approach is based on a language model 
of arbitrary length expressions, for which we develop a new methodology based on nested permutation 
tests to find significant phrases. We show that this method outperforms the more standard use of;^'^ and 
likelihood ratio tests. We illustrate the topic presentations on corpora of scientific abstracts and news 
articles. 

1 Introduction 

Topic models are hierarchical Bayesian models of document collections that explain an observed corpus with 
a small set of distributions over terms. When fit to a corpus, these distributions tend to correspond to intuitive 
notions of the topics or themes that pervade the documents. Topic models have emerged as a powerful tool 
for unsupervised analysis of text. They have been extended for authorship 1 19|, citation [jlOil . and discourse 
segmentation [ 18|. Review articles of topic modeling provide further applications |2ll. 

The idea behind topic modeling is to imagine a probabilistic process by which both a hidden thematic structure 
and observed collection of documents arises. Given the observed collection, one then "reverses" this process 
to determine the posterior distribution of the hidden thematic structure. Topic models build on and were 
inspired by techniques like latent semantic analysis (LSA) [2] and probabilistic latent semantic analysis 
(pLSA) |7J. However, LSA and pLSA do not embody generative probabilistic processes. By adopting a fully 
generative model, topic models such as latent Dirichlet allocation exhibit better generalization and are easily 
extendable lU. 

Once they are fit to a corpus, it is of interest to visualize the topics. These visualizations provide landmark 
descriptive statistics for understanding, exploring, and navigating through an otherwise unorganized collection 
of documents [11 J. Typically, one visuahzes each topic by simply listing the terms in order of decreasing 
probability. While a person can usually peruse these lists and intuit "meanings" of the topics, such visual- 
izations can be unsatisfying. Single terms are often part of indicative phrases, which are lost in a simple 



Annotated documents 



What is phasejj transitionn? Why is there phase^j transitionsn? 
These is are oldi27 questions^jr peoplei7o have been askingjgj for many 
yearsi27 S^tisa answersi27 We establishedi27 one general^j theory^j 



basedi53 on gamei53 theoryj. 



a basici 



understandingi27 to phases transitionsn We proposed^^ a moderni27 
dcflnitioniiT of phases transitionn basedi53 on gamei^j theoryj27 and 
topologygj of symmetryj^i groupig4 which unifiediss Ehrenfests defrnitionny 
A spontaneous!! resultgg of this topologicalgj phases transitionn theoryi27 
is the universali4 equationnj of coexistencejgs curveig5 in phasejj^ diagram^ 
it holdsi53 both for classicali22 and quantum^ phase^ transitionn This 



LDA topic #11 

phase, transitions, phases, transition, quantum, 
critical, symmetry, field, point, model, order, 
diagram, systems, two, theory, system, study, 
breaking, spin, first 

Turbo topic #11 

phase transitions, model, symmetry, point, 
quantum, systems, phase transition, phase 
diagram, system, order, field, order, parameter, 
critical, two, transitions in, models, different, 
symmetry breaking, first order, phenomena 



Figure 1 : An illustration of the turbo topics strategy. We first estimate an LDA topic model (under the word 
exchangeability assumption). We next annotate each word in the original corpus with its most likely posterior 
topic. This is illustrated at left in the subscript on each word and with topic 1 1 highlighted in yellow. We run 
a hypothesis testing procedure over the annotated corpus to identify significant words that appear to the left or 
right of a word or phrase labeled with a given topic. This procedure is carried out recursively, until no more 
significant phrases are found. At right we illustrate the original top words from topic 1 1 , and those find by the 
turbo topics strategy. Phrases like "phase diagram," "symmetry breaking," and "first order" are found by the 
procedure. More topics are illustrated in Figure [3] 

unigram representation. An alternative is to fit a more complicated model ||4l|24l|25l, but then one loses the 
computational advantage and statistical simplicity of unigram topic modeling. 

In this paper we introduce a new method for visualizing unigram topic models. In our approach, the model is 
first fit as usual, and then the posterior distribution is used to annotate each word occurrence of the corpus 
with its most probable topic. With this annotated corpus, we carry out a statistical co-occurrence analysis to 
extract the most significant n-grams for each topic. The resulting multi-term phrases are combined with the 
unigram lists to give a visualization that offers a better intuitive impression for what a topic is about. We call 
the resulting visualizations turbo topics, as suggested by the manner in which the method recycles the output 
of estimation to build a more powerful presentation of the model. 

As part of this procedure, we developed a new algorithm for finding multi-word expressions. Our method 
uses a back-off language model defined for arbitrary length expressions (H, and recursively employs the 
distribution-free permutation test to find significant phrases. In contrast, previous methods of finding multi- 
word expressions rely on a test statistic derived from a multinomial contingency table and, in most cases, 
appeal to the asymptotic distribution of that statistic |9]. We show that the permutation test works better in 
small sample settings, such as when we restrict our attention to topical terms, and the back-off model allows 
for finding multi-word phrases within a well-defined language model. 

We describe turbo topics in Section |2] and our new method of finding multi-word expressions in Section [3] 
In Section |4j we evaluate on simulated data and illustrate improved topic visualization with two real- world 
corpora. 
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2 Turbo Topics 



We first review latent Dirichlet allocation (LDA), a commonly used building block for topic models [IJ. We 
then describe our algorithm for finding multi-word expressions to visuahze topics. 

Latent Dirichlet allocation. LDA models documents as arising from multiple topics, where a topic is 
defined to be a distribution over a fixed vocabulary of terms lUl. Specifically, LDA assumes that K topics are 
associated with a collection, and that each document exhibits these topics with different proportions. This is 
often a natural assumption because documents tend to be heterogeneous. Each might combine a subset of 
themes that permeate the collection as a whole. 

LDA is a hidden variable model where the observed data are the words of each document and the hidden 
variables are the latent topical structure, i.e., the topics themselves and how each document exhibits them. 
Given a collection, the posterior distribution of the hidden variables given the words of the documents 
provides a probabilistic decomposition of the documents into topics. 

The statistical assumptions underlying LDA can be understood by its probabilistic generative process, the 
random process that is assumed to have produced the observed data. Let Khe a specified number of topics, V 
the size of the vocabulary, a a positive TiT- vector, and rj a scalar. LDA assumes that the collection arises as 
follows. 

For each of the K topics draw a distribution over words, 

fik ~ Dirichlety(77), 

where Dirichlety denotes the Dirichlet distribution on the V - I dimensional probability simplex. For each 
document d draw a vector of topic proportions 

dd ~ Dirichlet^(a). 

Finally, to select the ith word in the document, first draw a topic assignment from the topic proportions, 

Zd,i I dd ~ Multinomial(0rf) 
and then draw the word from the chosen topic 

WdjlZdj ~ MultinomialOSj.^,). 

This process specifies how the latent variables interact to produce the observed collection. 

Finding the posterior distribution of the hidden variables is akin to "reversing" this process given a corpus 
^^d}d=i- The posterior p{Pi:K,6\:D,Z\:D I w\-D,a,T}) provides a probabilistic decomposition of the corpus 
into its topics (iy -K- Each document exhibits multiple topics via its topic proportions Od', the words of each 
document are assigned to specific topics Zd,i- 

However, this posterior is intractable to compute; the central computational problem in topic modeling is to 
approximate it. Efficient approximation algorithms include Markov chain Monte Carlo sampling [21 1 and 
variational methods |[T3l [lll23l. Here we use mean-field variational inference UJ, though our methods can be 
used with any algorithm for approximate inference. 
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Multi-word expressions for topic visualization. The topics in LDA are unigram distributions over words, 
and the LDA model is exchangeable at the word level. This means that if the words in the corpus were 
shuffled within their documents then exactly the same inferences would result. Arguably, this is a reasonable 
assumption for topic modeling. When presented with a jumbled document, a human can often discern 
the thematic content of the text, even if he or she cannot reconstruct the detailed flow of the presentation. 
Exchangeable word models like LDA are simpler and offer computational advantages over more complex 
models that take word order into account [A ,24. .25.1 . 

Our goal is to enhance the interpretation of the model rather than the model itself, preserving the advantages 
of exchangeable modeling while attaining the expressive visualizations of ?i-gram modeling. Thus, our focus 
is on analyzing the posterior distribution of the topic structure of a corpus to determine phrases indicative of 
each of the topics. To do so, we first use the posterior of the topic variables Zdj to assign topics to words. 
Then, based on the original order of the words in the documents, we use the annotated words to find the 
significant topical «-grams that stem from them. 

Our strategy is as follows. 

1. Estimate an LDA topic model with K topics. This results in a posterior for topics, topic proportions, 
and per-word topic assignments. 

2. Using the posterior, annotate each word in the original corpus with a topic assignment. This yields an 
ordered sequence of word-topic pairs 

{Wi,Z\),{W2,Z2),{w^,Z3),{W4,Z4), . ■ ■ 

3. Given a word or phrase w and topic z of interest, run a hypothesis testing procedure to identify words v 
that are likely to precede or follow w when it is labeled as belonging to topic z- 

4. Repeat step 3 until no significant phrases are added. 
This is illustrated in Figure [T] 

In step 2, the sequence w\,W2, W3, . . . denotes the original text that comprises the corpus. This step annotates 
each word w, in the text with a topic assignment Zi, unless the word was removed by pre-processing (e.g., a 
stop word). The topic assigned to the word is determined by the posterior, and is document specific. Thus, 
the word "fly" in one document might be annotated by a topic about insects; the same word in another topic 
might be annotated by a topic about airplanes. Topic models capture polysemy, in the sense that they can 
assign the same term to difi"erent topics in different document contexts ll2TI . Once the words are annotated 
with topic assignments, document boundaries are ignored. 

Step 3 results in bigrams (w, v) or (v, w) for a given topic. Note that v may or may not have been assigned topic 
z. For instance, consider a topic focused on movies, and consider the movie title "Sex in the City." The terms 
"the" and "and" are stop words, which are not assigned to any topic; the term "city" may not be very relevant 
to the movies topic. However, if that topic assigns high probability to the term "sex," then our method will be 
able to identify the movie title because of the repeated context in which it appears. (See Figure|3]) 

3 Recursive Permutation Tests For Multi-Word Expressions 

A key component in turbo topics is the method for recursively identifying significant n-grams in a sequence 
of words. We describe our novel solution to this problem, appropriate for the sparse settings that arise in topic 
models. 
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A language model. Consider a corpus of words. Under an arbitrary language model, the log likelihood is 

log p{w) ^ T.'^^ylog p{Wn\wi, ... ,Wn-l). (1) 

In parameterizing this model, we can consider two extremes. On one extreme, the fully parameterized 
model contains a conditional distribution of words given each possible history. However, this model is 
computationally intractable because it requires specifying 0{V^) parameters. Moreover, it is statistically 
unreliable; there will typically not be enough data to support maximum likelihood estimates. At the other 
extreme, the unigram model posits that each word is independent of its history, p{wn \ wi, . . ., ly^-i) = p{w„). 
This model is efficient to estimate, but cannot capture dependencies between words. 

We adopt a middle ground between these models that sparsely parameterizes the full model using word 
histories of varying lengths. For example, consider a model where all words follow a unigram distribution 
except for those words that follow the word "new." Among words following "new," some words, such as 
"house," essentially follow their unigram distribution, while others, such as "york" or "jersey," are endowed 
with "new"-specific probabiUties. 

Our challenge is to determine which n-grams, such as "new york," should be given special probabilities 
;7("new")/7("york" | "new") and which should be modeled by products of their unigram probabilities. Then, to 
visualize the distribution, we order by probability the collection of n-grams represented in the model. 

Evaluating the likelihood in equation ([T]) requires a distribution over words conditioned on an arbitrary history 
of previous words. Denote a length n history by w\ n and let S ^]^.^ be a set of words that are governed by 
history-specific probabilities. To continue the example, if the history is "new" then 5 "new" might be {"york", 
"jersey", "hampshire"}. The conditional distribution over words is 

|,„ X _ l^w^+i if e S'u,,,, 

p(Wn+i\UJ\_;n) - \ . , . . . y-^) 

\ywi,„p{uin+\ I W2:n) Otherwise. 
The constant 7u;j „ ensures that the distribution sums to one, 

_ 1 -i:»e5„,,„^o|» 

This is a back-off model with a sparsely represented set of conditional distributions. Note that if Su)i.„ = {} is 
empty, then wi-„ is not endowed with a specific conditional distribution. In this case, yu;, = 1 and equation (|2]) 
gives that p(w \ wi:n) = p{w | wz.n)- This type of model has been investigated in the speech recognition and 
language modeling literature 



Expanding the model with likelihood ratios. We now describe how we search through the space of 
sparsity patterns for the parameters of the language model. Finding a good set of parameters amounts to 
identifying the set S h for each word history in a corpus. Beginning with a unigram model, where S h = [} 
for all histories, our approach is to greedily determine the words best governed by bigram probabilities. We 
then apply this procedure recursively to find higher order dependencies. 

Consider a sparse bigram model, a model with word histories of at most one word, and consider a single pair 
of words u and v such that v ^ S u- To determine whether to add a bigram parameter 7r„| „ to the model, we 
compute the log likelihood ratio of the expanded model to the unexpanded model, i.e., the log of the probability 
of the data under the expanded model divided by the probability of the data under the unexpanded model. 
The expanded model is the model with 5„ <— 5;, U {d) and conditional probabilities following equation ([2]). It 
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contains one new parameter tt"™, and a different estimate of the parameter tt"™. All other parameters are 
identical in both models. 

For the likelihood ratio, the only probabilities that change between the two models are for the words that 
follow u and for all instances of v. The probability of instances of v following u change from their unigram 
probability 7r„ to their bigram probability ^"j'^; the probability of other instances of v changes from to tt"*^^. 
Words following u that are not in S„ are governed by their unigram probabilities; however, they are scaled 
differently in the expanded model. The scaling constant of equation ([3]) uses both tt"™ and tt"*^^. 

Thus, the log likelihood ratio of the expanded model to the unexpanded model is 

LR„„ log Tr;;^;^ + («„ - «„„) log^^^^^ 

+ i:„'e5^/«««o'log(7r„,y-^) (4) 
- nuv logn„ - Y,v'es<i,lv nuv' log(7r„'7„). 

The first two lines are the log likelihood under the expanded model; the third line is the negative log likelihood 
under the unexpanded model. All parameters are computed as normalized counts HI. A value of LR„„ above 
zero indicates that the expanded model fits the data better than the unigram model. This quantity is closely 
related to an entropy-based score 11221 . 

For simplicity, we have described the likelihood ratio for expanding a unigram model to a sparse bigram 
model. This methodology can be generalized to word histories of arbitrary length, and the resulting likelihood 
ratios can be used to assess expansions beyond bigrams. In assessing such expansions, we compute the ratio 
of the log likelihoods for a model with history h and a model with expanded history h\J {v}. The unexpanded 
model's back-off probabilities are for words given the original history. 



Recursive permutation tests. With the likelihood ratio in hand, our next task is to develop an algorithm 
for building up a model from the unigram parameterization. Our algorithm adds parameters as needed by 
carrying out a sequence of hypothesis tests. 

Given a previous word u, consider the best candidate word that is not yet modeled as a bigram, 

V* - argmaxLRi,„. 

viSu 

Suppose that LR„„. is greater than zero. When is this significant, and when it is an artifact of a small data 
sample? 

We answer this question with a permutation test |[T6l l6]|. The permutation test determines the significance 
of a score, as in equation Q, that measures the degree of dependence between two random variables. To 
decide whether a log likelihood ratio is positive, we shuffle the data in such a way as to remove the added 
dependence, but retaining the previously modeled dependencies. We then compute the same log likelihood 
ratio on the shuffled data. A score computed from shuffled data that is greater than the score we are testing is 
evidence that the true score is not significant, because it arose in a data set where any such dependency has 
been removed by design. Repeating this process, the proportion of scores under shuffled data greater than the 
quantity in question provides a p-value for its significance. 

Many hypothesis tests rely on assumptions about the asymptotic distribution of the test statistic. The 
permutation test relies on no such asymptotic assumption, and is thus particularly suited to sparse data settings. 
In natural language processing, permutation tests have been used for word collocations in multinomial 
models [i5J and for bilingual associations |[T4ll . They have not been developed for the back-off language 
models that we consider. 
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Returning to our task, recall that a positive likelihood ratio LR„(,. indicates that u and v* are dependent in the 
joint distribution. To perform the test, we repeatedly shuffle all the words (within and across documents), but 
retain the sequences of words that are currently modeled. This removes dependencies between u and terms 
that are not in Sj,, while retaining the other dependencies that are already assumed. In the permuted data, if 
there is a y such that LRj^^™"''^'' is greater than LRj(„. , then this is an indication that our likelihood ratio score 
is not significant. 

We perform the following procedure to find the set of words 5„ with which to endow special w-specific 
probabilities, 

1. Sety* = argmaXu^s^LRHu. 

2. Sample M permutations of the data that respect the current estimate oiSi,. Compute 

p-value„„. * ^ (# scores > LR„„.) 

3. If p-value„„. is less than the desired threshold then add m to 5 m and repeat. 



Intuitions and previous approaches. There are two primary differences between this approach and previ- 
ous approaches for finding phrases. First, by testing the maximum log likelihood ratio, we address the issue 
of finding multiple related collocations rooted in a single word. If we simultaneously tested each possible 
expansion, then the hypothesis tests would be dependent on each other. Traditional phrase finding algorithms 
are based on a multinomial contingency table and subsequent hypothesis test |9J. They test all collocations 
simultaneously, without accounting for the bias that this introduces. 

This point can be made more concrete with our running example. Suppose "new" occurs 10,000 times in the 
corpus, the word "york" follows it 6000 times, and the word "jersey" follows it 3,000 times. Once we take 
"york" into account, "new jersey" can occur at most 4000 times. By not accounting for the the instances of 
"new york" we see that 3000 out of 4000 is a strong signal of a bigram. Leaving in "new york," as we would if 
we simultaneously tested both words, we would find that 3000 out of 10,000 is not as strong a signal. 

Second, and more importantly, our method is based on expanding the sparsity pattern of the parameters of the 
model in equation ([2]). When a significant word collocation is found, we expand the model to share fewer 
parameters; but the resulting expanded model is still a valid model. We can thus apply the hypothesis test 
recursively. 

In traditional algorithms for finding collocations, there is no principled way to expand a model once a 
significant pair of words is found. For example, suppose a traditional algorithm finds "new york" to be a 
significant bigram. One can try to add "new york" to the vocabulary of the multinomial model. However, 
the resulting distribution, with probabilities for "new", "york" and "new york", will then have two ways of 
generating the bigram "new york." (One way is to generate the two words independently; the other is to 
generate the added bigram.) It is not clear how to account for observed text with such a distribution. 



4 Empirical Results 

We first compare the permutation test for back-off models with previous methods of finding significant 
bigrams. We then demonstrate the full procedure for visualizing topics. 
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Figure 2: The F-measure for simulated corpora of different sizes (10^ 10"^, 10^ and lO*^ words) and for 
three different p- values. Likelihood ratio test I is the method of |3|. Likelihood ratio test II is the back-off 
model in Section [3] using the asymptotic distribution of the log likelihood ratio. Permutation test I is the 
multinomial model of IfTTI . Permutation test II is the procedure of Section [3] All permutation tests used 1000 
permutations. Methods relying on the asymptotic distribution of the test statistic perform better as more data 
is seen. Methods that employ the permutation test perform well on all data set sizes, and perform better than 
those methods relying on asymptotics. For this simulated data, the model of Section |3]performs as well as a 
simple multinomial model. However, it further allows for finding multi-word expressions within a proper 
language model. (See Figure [3]). 



4.1 Simulated bigrams 

We evaluate our method on simulated text data with known bigrams. The data are drawn sequentially from a 
Chinese restaurant process (CRP) 1 17 |. The CRP is a distribution over a potentially infinite vocabulary. To 
simulate N words, we draw each from a distribution where the probability of any previously seen word is 
proportional to the number of times it has been drawn, and the probability of a new word is proportional to a 
scaling parameter a. More formally, the nth word is drawn from the following distribution, 

|?i„ if V exists; 
■r ■ 
a if y IS a new word. 

We embellish the CRP to create a corpus with bigrams. When a new term is to be added to the vocabulary, it 
will be a collocation of two previously existing singleton terms with probability p. It will be a new singleton 
term with probability I - /S. CRP-based distributions such as this one have been shown to match qualities of 
word frequencies found in natural language Q . 

Simulating from this process yields a random corpus with a vocabulary containing both singletons and bigrams. 
However, an observed bigram is indistinguishable from a pair of singletons. Thus, using this corpus as input 
to a word collocation algorithm, we can compare the set of found bigrams to the true set of bigrams. We 
measure success with weighted precision and recall. Note that in these simulations we did not find or produce 
phrases of more than two words, as our purpose is to compare to previous techniques. 

We compared several tests for bigram discovery. 

The;^'^ test. This is a classical test of independence for discrete variables. 
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Figure 3: Standard unigram display of topics compared with turbo topics for two corpora, the Huffington Post 
(left), and physics arXiv (right). Four topics are shown for each corpus, comparing the unigram visualization 
(bottom) with the turbo topic visualization (top). The presentations that include the n-grams are more 
descriptive, uncovering n-grams such as "Indiana jones" and "the California supreme court" in the case of the 
Huffington Post, and "monte carlo simulations" and "chiral symmetry breaking" in the case of the physics 
abstracts. 



Likelihood ratio tests. Here, we obtain p-values from the asymptotic distribution of twice the likelihood 
ratio. We implemented a simple multinomial model IS and the back-off model in Section[3j 

Permutation tests. As described above, these yield a distribution-free method of obtaining p-values. We 
employed a permutation test with both a simple multinomial model l,15J and the back-off model from Section [3] 
(A comparative study was not performed in lITSl .) 

Figure |2] illustrates the F-measure achieved by these tests for four simulated corpora of different sizes and for 
three p- value thresholds. The corpora were created with parameters a = 1000 and /? = 0.1. The tests that rely 
on asymptotics improve performance as the corpus size increases. Permutation tests perform well on small 
and large corpora. The simple multinomial model is as effective as the model of Section [3] for this task, but 
the procedure presented here allows for recursive detection of word phrases. 

4.2 Example topic visualizations 

We demonstrate turbo topics with two corporaj^ First, we find topics from each corpora using variational 
expectation maximization yj. We then restrict our attention to contexts surrounding the words assigned to 

'Code can be found at http://www.cs.princeton.edu/~blei/ 
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each topic, and use the recursive permutation tests of Section[2]to find a set of back-off models, one for each 
topic. 

To visualize, we consider the significant phrases of each model, i.e., those n-grams for which the algorithm 
chose to explicitly represent parameters. We order the «-grams by their probabilities. To aid in visualization, 
when an «-gram subsumes a shorter ?i-gram with lower probability, we incorporate the shorter n-gram's mass 
into the longer ?i-grams probability. (The shorter w-gram's probability is determined by how often it occurs 
without being part of the longer «-gram.) For example, if an expanded topic contains "The New York Mets" 
with high probability and "New York Mets" with lower probability, then we add the probability of the shorter 
phrase to the longer phrase. If a shorter phrase, such as "court," appears on its own with higher probability 
than a longer phrase, such as "supreme court," then both are considered. 

Our first corpus contains articles from the Huffington Post, an online news service. This corpus contains 4000 
documents and has a vocabulary of 6500 terms. Second, we use the 2006 physics abstracts from arXiv.org, an 
online scientific preprint service. This corpus contains 50,000 documents and has a vocabulary of 17,000 
terms. For both corpora, the vocabulary was obtained by removing stop words and infrequent terms (appearing 
in fewer than 20 documents) from the topic analysis. These terms were, however, considered in the phrase 
analysis. 

Figure [3] shows, for each collection, four of the original topics and the corresponding turbo topics. (The 
news model contains 100 topics; the physics model contains 200 topics. For both corpora, the number of 
permutations was 100 and the p- value threshold was 0.01.) Under the expanded view with bigrams and longer 
phrases, what the topic is "about" comes into sharper focus. For instance, while we see from the ordered 
list of unigrams that a topic is about movies, it may not be immediately apparent what "jones" and "city" 
refer to. In the expanded visualization, the phrases "Indiana jones" and "sex in the city" provide a clearer 
indication why these terms appear with such high probability. Similarly, in a topic that concerns gay marriage, 
"the California supreme court" appears, offering a refinement of the terms "court" and "supreme" which are 
separated in the standard probability-sorted unigram list. Similar effects are seen in the topics extracted from 
the corpus of physics abstracts. While one of the unigram topics assigns high weight to "black" and "holes," 
the phrases "black hole mass," "star formation," and "supermassive black holes" are more suggestive of how 
the topic is used. 

5 Discussion 

Topic models are formulated and estimated based on the assumption of word-level exchangeability, which 
leads to relatively simple and computationally efficient inference algorithms. This "bag of words" assumption 
is reasonable for identifying topics, but it becomes a handicap when interpreting them. The salient bigrams 
and phrases for a topical word provide an indication of the role it plays in the topic. 

We have developed a new procedure for determining the salient phrases for a topic. Our procedure preserves 
the simplicity of an exchangeable model while incorporating some of the context of richer models. Though 
we focused on topics derived from LDA, we emphasize that this procedure can be used for any topic model, 
provided there is a latent topic assignment variable for each word of the corpus. Other examples include author- 
topic models |[T9l and conditional topic models lfT2l . Moreover, although we focused on topic visualization, 
one can imagine other uses of the resulting phrases, e.g., for information retrieval, that would not require a 
full generative model. 

In terms of statistical methodology, our results demonstrate that the use of permutation testing is appropriate 
and effective in this setting, where sparse statistics render tests based on asymptotic distributions less accurate. 
Although we have implemented a simple greedy strategy based on recursively applying the permutation test. 
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it would be of interest to investigate more computationally efficient procedures for simultaneously testing and 
correcting for multiple hypotheses for expanding word phrases. 
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